This question involves the use of simple linear regression on the Auto data set.
Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:
i. Is there a relationship between the predictor and the response?
ii. How strong is the relationship between the predictor and the response?
iii. Is the relationship between the predictor and the response positive or negative?
iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?
fit <- lm(mpg ~ horsepower, data = Auto)
summary(fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
There is a relationship between the predictor and response: the results suggest that with a one unit increase in horsepower, mpg decreases by 0.16. The relationship is hence negative. The low p-value on the coefficient and high F-statistic suggest that this result is highly statistically significant - we cannot reject the null hypothesis that there is no relationship between the response and predictor. The \(R^2\) of 0.6 suggests the model explains about 60% of the variance in the data.
The predicted mpg associated with a horsepower of 98 is:
predict(fit, newdata = data.frame(horsepower = 98))
## 1
## 24.46708
The associated 95% confidence interval is:
predict(fit, newdata = data.frame(horsepower = 98), interval = "confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
The associated 95% prediction interval is:
predict(fit, newdata = data.frame(horsepower = 98), interval = "prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
Plot the response and the predictor. Use the abline() function to display the least squares regression line.
plot(Auto$horsepower, Auto$mpg, xlab = "horsepower", ylab = "mpg")
abline(fit)
Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
par(mfrow = c(2,2))
plot(fit)
The residuals versus fitted plot suggests a non-linear relationship in the data, while the scale-location plot also indicates heteroscedasticity (non-constant variance of residuals). The Q-Q plot shows that the residuals deviate from normality at the top of the distribution, suggesting they are right skewed. In the residuals vs leverage plot, most points lie within an acceptable range for the standardised residuals, so there do not appear to be any significant outliers, although a handful of points seem to have high leverage relative to the rest of the observations.
This question involves the use of multiple linear regression on the Auto data set
Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(Auto)
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(Auto[,which(names(Auto) != "name")])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
i. Is there a relationship between the predictors and the response?
ii. Which predictors appear to have a statistically significant relationship to the response?
iii. What does the coefficient for the year variable suggest?
origin is more properly encoded as a categorical variable, which we can do using factor before fitting the model.
Auto$origin <- factor(Auto$origin)
fit <- lm(mpg ~ ., data = Auto[,which(names(Auto) != "name")])
summary(fit)
##
## Call:
## lm(formula = mpg ~ ., data = Auto[, which(names(Auto) != "name")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## origin2 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## origin3 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
The high F-statistic and low corresponding p-value suggest that there is a relationship between the predictors and the response. weight, year and origin all appear significant at the 99.9% level given their p-values, while displacement is significant at the 99% level. The coefficient for year suggests that for a car newer by one year, mpg increases by 0.75. The RSE is lower than in the single predictor model, suggesting this model fits the data better than the simple regression.
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2,2))
plot(fit)
The residuals vs fitted plot shows some evidence of non-linearity, with the residuals showing a slight u-shape. Some of the observations at the top of the range of fitted values have quite high residuals, for instance observation 327. The scale-location plot also suggests non-constant variance of residual errors. The Q-Q plot suggests non-normality at the top of the distribution, with the same right-skew noted in the simple regression fit. The residuals vs leverage plot show some points with quite high standardised residuals (greater than 3) but none of these have unusually high leverage. Point 14 does have high leverage, but its standardised residual is within a reasonable range.
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
We could hypothesise that measures of engine power and size may have a non-additive effect, so we could try a fit with an interaction between displacement and horsepower. Similarly, acceleration and weight may have a non-additive effect:
fit2 <- lm(mpg ~ . + horsepower:displacement, data = Auto[,which(names(Auto) != "name")])
fit3 <- lm(mpg ~ . + acceleration:weight, data = Auto[,which(names(Auto) != "name")])
summary(fit2)
##
## Call:
## lm(formula = mpg ~ . + horsepower:displacement, data = Auto[,
## which(names(Auto) != "name")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5801 -1.6052 -0.0798 1.4166 12.5839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.859e+00 4.397e+00 -0.423 0.67277
## cylinders 6.344e-01 3.026e-01 2.096 0.03672 *
## displacement -7.287e-02 1.141e-02 -6.388 4.88e-10 ***
## horsepower -1.957e-01 2.074e-02 -9.435 < 2e-16 ***
## weight -3.238e-03 6.650e-04 -4.870 1.64e-06 ***
## acceleration -2.101e-01 9.083e-02 -2.313 0.02127 *
## year 7.439e-01 4.576e-02 16.258 < 2e-16 ***
## origin2 9.690e-01 5.236e-01 1.850 0.06501 .
## origin3 1.395e+00 5.065e-01 2.754 0.00616 **
## displacement:horsepower 5.176e-04 4.916e-05 10.530 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.915 on 382 degrees of freedom
## Multiple R-squared: 0.8637, Adjusted R-squared: 0.8605
## F-statistic: 269.1 on 9 and 382 DF, p-value: < 2.2e-16
summary(fit3)
##
## Call:
## lm(formula = mpg ~ . + acceleration:weight, data = Auto[, which(names(Auto) !=
## "name")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7826 -1.9450 -0.0021 1.6532 12.4624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.393e+01 5.836e+00 -7.528 3.75e-13 ***
## cylinders -2.162e-01 3.063e-01 -0.706 0.480788
## displacement 6.888e-03 7.657e-03 0.900 0.368933
## horsepower -4.199e-02 1.342e-02 -3.129 0.001891 **
## weight 3.633e-03 1.638e-03 2.218 0.027146 *
## acceleration 1.599e+00 2.414e-01 6.623 1.19e-10 ***
## year 8.037e-01 4.911e-02 16.364 < 2e-16 ***
## origin2 2.054e+00 5.421e-01 3.789 0.000176 ***
## origin3 2.081e+00 5.347e-01 3.893 0.000117 ***
## weight:acceleration -5.717e-04 8.383e-05 -6.820 3.57e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.126 on 382 degrees of freedom
## Multiple R-squared: 0.8433, Adjusted R-squared: 0.8396
## F-statistic: 228.4 on 9 and 382 DF, p-value: < 2.2e-16
In both cases, the interaction terms are highly significant, as indicated by their very small p-values.
Try a few different transformations of the variables, such as \(log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.
The pairwise plots suggest non-linear relationships between mpg and: horsepower, weight, displacement and year. Given the other models showed heteroscedasticity, we can also apply a log transformation to the response, which should bring in the more extreme values:
fit4 <- lm(log(mpg) ~ I(horsepower^2) + I(weight^2) + I(displacement^2) + sqrt(year) + ., data = Auto[,-9])
summary(fit4)
##
## Call:
## lm(formula = log(mpg) ~ I(horsepower^2) + I(weight^2) + I(displacement^2) +
## sqrt(year) + ., data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38667 -0.06680 0.00696 0.06152 0.34173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.553e+01 1.090e+01 5.096 5.47e-07 ***
## I(horsepower^2) 7.959e-06 5.233e-06 1.521 0.12910
## I(weight^2) 9.590e-09 1.347e-08 0.712 0.47677
## I(displacement^2) 3.753e-06 1.345e-06 2.790 0.00554 **
## sqrt(year) -1.227e+01 2.503e+00 -4.903 1.40e-06 ***
## cylinders 7.141e-03 1.249e-02 0.572 0.56787
## displacement -1.936e-03 8.007e-04 -2.418 0.01606 *
## horsepower -5.096e-03 1.548e-03 -3.292 0.00109 **
## weight -2.266e-04 1.013e-04 -2.237 0.02587 *
## acceleration -7.665e-03 3.779e-03 -2.028 0.04325 *
## year 7.339e-01 1.435e-01 5.113 5.04e-07 ***
## origin2 4.236e-02 2.077e-02 2.040 0.04204 *
## origin3 3.411e-02 2.008e-02 1.698 0.09031 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1087 on 379 degrees of freedom
## Multiple R-squared: 0.9009, Adjusted R-squared: 0.8978
## F-statistic: 287.2 on 12 and 379 DF, p-value: < 2.2e-16
The transformed variables for year and displacement are both significant to the 99% level, but the transformed horsepower and weight predictors do not appear to be significant. In this fit, acceleration is significant to the 95% level, while in the fit without the transformed variables it was insignificant. origin3 is less significant in this model, with a lower p-value than in the fit without the transformations.
par(mfrow = c(2,2))
plot(fit4)
There is no evidence of non-linearity in the residuals plot, and the scale-location plot shows no signs of heteroscedasticity. However, the Q-Q plot shows some deviation in the residuals from normal, suggesting a fat-tailed distribution. The residuals vs leverage plot shows a handful of points that have higher leverage than most other observations, but these residuals are mostly within a reasonable range.
This question should be answered using the Carseats data set.
Fit a multiple regression model to predict Sales using Price, Urban, and US.
fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Provide an interpretation of each coefficient in the model. Be careful — some of the variables in the model are qualitative!
According to the model results, while holding other predictors constant:
Write out the model in equation form, being careful to handle the qualitative variables properly.
The model can be written as follows (assuming Urban is coded as 1 for an urban location and 0 otherwise, and US is coded 1 for ‘Yes’ and 0 for ‘No’): \[
Carseats=13.043469-0.054459*Price-0.021916*Urban-1.200573*US+\epsilon
\]
For which of the predictors can you reject the null hypothesis \(H_0:\beta_j=0\)?
Based on the the p-values for the coefficients, we can reject the null hypothesis for Price and US.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data?
Both models have an \(R^2\) of 0.2393, explaining about 24% of the variance in the data. The second model has a slightly lower RSE. The second model is simpler and fits the data at least as well as the first, so we should prefer the second model.
Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2,2))
plot(fit2)
The diagnostic plots show a handful of points with quite high standardised residuals (greater than three in absolute value), but none of these have high leverage so may not affect the fit very much. Those that do have high leverage relative to other observations have small residuals, so are not outliers.
In this problem we will investigate the t-statistic for the null hypothesis \(H_0:\beta=0\) in simple linear regression without an intercept. To begin, we generate a predictor \(x\) and a response \(y\) as follows. Generate a predictor and response:
set.seed(1)
x <- rnorm(100)
y <- 2*x + rnorm(100)
Perform a simple linear regression of y onto x, without an intercept. Report the coefficient estimate \(\hat{\beta}\), the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis \(H_0:\beta=0\). Comment on these results. (You can perform regression without an intercept using the command lm(y∼x+0).)
lm.fit <- lm(y ~ x + 0)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The high t-value and low p-value indicates a high level of significance so we can reject \(H_0\). The coefficient is very close to the true value of 2.
Now perform a simple linear regression of x onto y without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis \(H_0:\beta=0\). Comment on these results.
lm.fit1 <- lm(x ~ y + 0)
summary(lm.fit1)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The t value and p values are identical, so we can again reject \(H_0\), but the standard error is smaller than the first model.
What is the relationship between the results obtained in (a) and (b)?
The standard error is proportionally smaller in (b) with regard to the coefficient estimate.
For the regression of \(Y\) onto \(X\) without an intercept, the t-statistic for \(H_0:\beta=0\) takes the form \(\hat{\beta}/SE(\hat{\beta})\), where \(\hat{\beta}\) is given by (3.38), and where \[SE(\hat{\beta}) = \sqrt{\frac{\sum_{i=1}^n(y_i-x_i\hat{\beta})^2}{(n-1)\sum_{i'=1}^nx_{i'}^2}}\] Show algebraically, and confirm numerically in R, that the t-statistic can be written as \[\frac{(\sqrt{n-1})\sum_{i=1}^nx_iy_i}{\sqrt{(\sum_{i=1}^n{y_i^2})(\sum_{i=1}^nx_i^2)-(\sum_{i=1}^nx_iy_i)^2}}\] Begin by squaring the t-statistic: \[(\frac{\hat{\beta}}{SE(\hat{\beta})})^2=\frac{(\sum_{i=1}^n{x_iy_i})^2}{(\sum_{i=1}^n{x_i^2})^2}\cdot{\frac{(n-1)\sum_{i'=1}^nx_{i'}^2}{\sum_{i=1}^n(y_i-x_i\hat{\beta})^2}}\\ =\frac{(\sum_{i=1}^n{x_iy_i})^2}{(\sum_{i=1}^n{x_i^2})^2}\cdot{\frac{(n-1)\sum_{i'=1}^nx_{i'}^2}{\sum_{i=1}^n{y_i^2}-\beta^2\sum_{i=1}^nx_i^2}}\\ =\frac{(\sum_{i=1}^n{x_iy_i})^2}{(\sum_{i=1}^n{x_i^2})^2}\cdot{\frac{(n-1)\sum_{i'=1}^nx_{i'}^2}{\sum_{i=1}^n{y_i^2}-(\sum_{i=1}^nx_iy_i)^2/\sum_{i=1}^nx_i^2}}\\ ={\frac{(\sum_{i=1}^n{x_iy_i})^2(n-1)}{\sum_{i=1}^n{y_i^2}\sum_{i=1}^nx_i^2-(\sum_{i=1}^nx_iy_i)^2}}\] Taking the square root of this expression: \[\frac{\hat{\beta}}{SE(\hat{\beta})}=\frac{(\sqrt{n-1})\sum_{i=1}^nx_iy_i}{\sqrt{\sum_{i=1}^n{y_i^2}\sum_{i=1}^nx_i^2-(\sum_{i=1}^nx_iy_i)^2}}\] We can confirm this result numerically:
calculated_t <- sqrt(length(x) - 1) * sum(x*y) / sqrt(sum(x^2)*sum(y^2) - (sum(x*y))^2)
model_t <- coef(summary(lm.fit1))[, "t value"]
all.equal(calculated_t, model_t)
## [1] TRUE
Using the results from (d), argue that the t-statistic for the regression of y onto x is the same as the t-statistic for the regression of x onto y.
As is clear from the derivation, the t-statistic does not depend on the ordering of the variables (i.e. it does not matter if y is regressed on x or vice versa). Hence, the t-statistic is the same for both models. The standard error is different proportionally to the coefficient, which is consistent with this finding.
In R, show that when regression is performed with an intercept, the t-statistic for \(H_0:\beta_1=0\) is the same for the regression of y onto x as it is for the regression of x onto y.
all.equal(coef(summary(lm(y~x)))[2, "t value"], coef(summary(lm(x~y)))[2, "t value"])
## [1] TRUE
This problem involves simple linear regression without an intercept.
Recall that the coefficient estimate \(\hat{\beta}\) for the linear regression of \(Y\) onto \(X\) without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of \(X\) onto \(Y\) the same as the coefficient estimate for the regression of \(Y\) onto \(X\)?
When performing a linear regression without an intercept, the coefficient when regressing \(X\) on \(Y\) and \(Y\) on \(X\) will be the same where the sum of the squared values in \(X\) is the same as the sum of the squared values in \(Y\).
Generate an example in R with \(n=100\) observations in which the coefficient estimate for the regression of \(X\) onto \(Y\) is different from the coefficient estimate for the regression of \(Y\) onto \(X\).
x <- rnorm(100)
y <- rnorm(100)
lm(x ~ y + 0)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Coefficients:
## y
## 0.1168
lm(y ~ x + 0)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Coefficients:
## x
## 0.1076
Generate an example in R with \(n=100\) observations in which the coefficient estimate for the regression of \(X\) onto \(Y\) is the same as the coefficient estimate for the regression of \(Y\) onto \(X\).
x <- c(rep(2, 25), rep(0,75))
y <- c(rep(5,4), rep(0,96))
lm(x ~ y + 0)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Coefficients:
## y
## 0.4
lm(y ~ x + 0)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Coefficients:
## x
## 0.4
In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.
Using the rnorm() function, create a vector, x, containing 100 observations drawn from a \(N(0,1)\) distribution. This represents a feature, \(X\).
set.seed(1)
x <- rnorm(100)
Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a \(N(0,0.25)\) distribution i.e. a normal distribution with mean zero and variance 0.25.
eps <- rnorm(100, 0, 0.25)
Using x and eps, generate a vector y according to the model \(Y=-1+0.5X+\epsilon\). What is the length of the vector y? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model?
y <- -1 + 0.5*x + eps
length(y)
## [1] 100
y is of length 100. In the model, \(\beta_0\) is -1 and \(\beta_1\) is 0.5.
Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
plot(x,y)
The scatterplot shows a cluster of values for x around 0, the mean, with more sparse points going up to 3 and -3. y takes on values between -2.5 and 0.5, with the majority of points around -1 coinciding with x at zero. The relationship between the two appears linear with a slope of approximately 0.5 (the range of x is about 6 - between 3 and -3 - and the range of y is about 3 - between -2.5 and 0.5). These observations are consistent with the linear model used to generate y.
Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \(\hat{\beta_0}\) and \(\hat{\beta_1}\) compare to \(\beta_0\) and \(\beta_1\)?
lm.fit <- lm(y ~ x)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
The intercept of -1.00942 and coefficient of 0.49973 are very close to the model used to generate the data. The intercept is within one estimated standard error of the true value, and the coefficient is within just over one SE of the true value. This is reflected in the extremely small p-values.
Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.
plot(x,y)
abline(lm.fit, col = "blue")
abline( -1, 0.5, col= "red")
legend(x = "bottomright", legend = c("model", "true"), lwd = 3, col = c("blue", "red"))
Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.
lm.fit2 <- lm(y ~ x + I(x^2))
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4913 -0.1563 -0.0322 0.1451 0.5675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.98582 0.02941 -33.516 <2e-16 ***
## x 0.50429 0.02700 18.680 <2e-16 ***
## I(x^2) -0.02973 0.02119 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2395 on 97 degrees of freedom
## Multiple R-squared: 0.7828, Adjusted R-squared: 0.7784
## F-statistic: 174.8 on 2 and 97 DF, p-value: < 2.2e-16
There is a small increase in model fit, with a slightly higher \(R^2\) and lower RSE. Given that this improvement is marginal and the polynomial term is not statistically significant, this does not represent a major improvement on the original model.
Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results
eps <- rnorm(100, 0, 0.1)
y <- -1 + 0.5*x + eps
lm.fit3 <- lm(y ~ x)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.291411 -0.048230 -0.004533 0.064924 0.264157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.99726 0.01047 -95.25 <2e-16 ***
## x 0.50212 0.01163 43.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1039 on 98 degrees of freedom
## Multiple R-squared: 0.9501, Adjusted R-squared: 0.9495
## F-statistic: 1864 on 1 and 98 DF, p-value: < 2.2e-16
plot(x,y)
abline(lm.fit3, col = "blue")
abline(-1,0.5,col = "red")
legend(x = "bottomright", legend = c("model", "true"), lwd = 3, col = c("blue", "red"))
With less noise, the coefficient and the intercept are both well within one estimated standard error of the true value. This is clear from the plot, where the linear model line and true population line are indistinguishable.
Repeat (a)–(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term ε in (b). Describe your results.
eps <- rnorm(100, 0, 0.5)
y <- -1 + 0.5*x + eps
lm.fit4 <- lm(y ~ x)
summary(lm.fit4)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25813 -0.27262 -0.01888 0.33644 0.93944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.97117 0.05014 -19.369 < 2e-16 ***
## x 0.47216 0.05569 8.478 2.4e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4977 on 98 degrees of freedom
## Multiple R-squared: 0.4231, Adjusted R-squared: 0.4172
## F-statistic: 71.87 on 1 and 98 DF, p-value: 2.4e-13
plot(x,y)
abline(lm.fit4, col = "blue")
abline(-1,0.5,col = "red")
legend(x = "bottomright", legend = c("model", "true"), lwd = 3, col = c("blue", "red"))
Although the intercept and coefficients are still close to the true values, the p-value on the coefficient is larger than in the previous two cases. The plot shows a worse fit than the first model, with a clearly different slope and intercept compared to the true function.
What are the confidence intervals for \(\beta_0\) and \(\beta_1\) based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.
We can see that the confidence intervals are wider in the case with more noise and narrower with less noise. More noise therefore diminishes the power of hypothesis tests.
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) -1.0575402 -0.9613061
## x 0.4462897 0.5531801
confint(lm.fit3)
## 2.5 % 97.5 %
## (Intercept) -1.0180413 -0.9764850
## x 0.4790377 0.5251957
confint(lm.fit4)
## 2.5 % 97.5 %
## (Intercept) -1.070670 -0.8716647
## x 0.361636 0.5826779
This problem focuses on the collinearity problem.
Perform the following commands in R:
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100)/10
y <- 2 + 2*x1 + 0.3*x2 + rnorm(100)
The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?
\[Y = 2 + 2X_1 + 0.3X_2 + \epsilon\] 2 and 0.3 are the coefficients.
What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.
The two features are highly correlated with a clear linear relationship shown in the scatter plot.
cor(x1,x2)
## [1] 0.8351212
plot(x1,x2)
Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are \(\beta_0\), \(\beta_1\), and \(\beta_2\) ? How do these relate to the true \(\beta_0\), \(\beta_1\), and \(\beta_2\)? Can you reject the null hypothesis \(H_0:\beta_1=0\)? How about the null hypothesis \(H_0:\beta_2=0\)?
lm.fit1 <- lm(y ~ x1 + x2)
summary(lm.fit1)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
The intercept is close to the true value of 2, but the coefficients are quite a long way from their actual values - this is reflected in their p-values and t-statistics, which suggest that we cannot reject the null hypothesis for the x2 estimate. The coefficient for x1 is significant at the 90% level, so we could potentially reject the null for the x1 estimate (zero is only just within two standard errors of the estimate). The F-statistic is relatively high, with a low corresponding p-value, suggesting that the predictors are related to the response.
Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis \(H_0:\beta_1=0\)?
lm.fit2 <- lm(y ~ x1)
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
In this model, the coefficient on x1 is highly significant, so we can reject the null hypothesis that \(\beta_1=0\). The estimate for the x1 coefficient is very close to its true value.
Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis \(H_0:\beta_1=0\)?
lm.fit3 <- lm(y ~ x2)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
Similarly to the previous model, the coefficient estimate for x2 is highly significant and we can reject the null hypothesis. The coefficient is, however, quite different from in the original model - this is because it also includes the effect of x1 on the response.
Do the results obtained in (c)–(e) contradict each other? Explain your answer.
These reults do not contradict each other because the two predictors are highly correlated - the collinearity problem arose in the first model, where the separate effects of each variable are unclear on the response as they correlate with one another very closely. In the presence of collinearity, the parameter estimates in the linear model will become less accurate, implying a decrease in the t-statistic and hence a higher p-value. When we remove one of the predictors, this problem no longer affects the model, and so the estimates might be statistically significant even if they were not in the two predictor fit.
Now suppose we obtain one additional observation, which was unfortunately mismeasured.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y,6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.
lm.fit4 <- lm(y ~ x1 + x2)
summary(lm.fit4)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
In this case, we can reject the null hypothesis that \(\beta_2 = 0\), as the coefficient is significant at the 99% level, but we cannot reject the null for \(\beta_1\). The coefficient estimates here are considerably different with the addition of the single extra observation, and the significance of the parameter estimates is quite different. This implies that the extra observation has high leverage. The \(R^2\) for this model is higher than the first, suggesting that this model fits the data better.
lm.fit5 <- lm(y ~ x1)
lm.fit6 <- lm(y ~ x2)
summary(lm.fit5)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(lm.fit6)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
Similarly to the first set of models, we can reject the null hypothesis in the two single predictor models. Like the 2-variable model with the new observation, the coefficient estimates are quite different to the original models. The \(R^2\) for the x1 model is considerably lower, while that for the x2 model is higher, than with the original data.
par(mfrow = c(2,2))
plot(lm.fit4)
As expected, the new observation has very high leverage, as shown in the residuals vs leverage diagnostic plot.
par(mfrow = c(2,2))
plot(lm.fit5)
In the second new model, the new observation does not have especially high leverage but the residual for this observation is quite high so may be an outlier.
par(mfrow = c(2,2))
plot(lm.fit6)
In the final model, the new observation has high leverage relative to the other observations but to nowhere near the extent of the two-variable model, and is not an outlier given its small residual.
These results suggest that in the presence of collinearity, linear model fits are more susceptible to problematic data points that could be outliers or have high leverage.
This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.
For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
single.lm <- lapply(names(Boston)[-1], function(x) lm(Boston[["crim"]] ~ Boston[[x]]))
names(single.lm) <- names(Boston)[-1]
lapply(single.lm, summary)
## $zn
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.429 -4.222 -2.620 1.250 84.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45369 0.41722 10.675 < 2e-16 ***
## Boston[[x]] -0.07393 0.01609 -4.594 5.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared: 0.04019, Adjusted R-squared: 0.03828
## F-statistic: 21.1 on 1 and 504 DF, p-value: 5.506e-06
##
##
## $indus
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.972 -2.698 -0.736 0.712 81.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.06374 0.66723 -3.093 0.00209 **
## Boston[[x]] 0.50978 0.05102 9.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared: 0.1653, Adjusted R-squared: 0.1637
## F-statistic: 99.82 on 1 and 504 DF, p-value: < 2.2e-16
##
##
## $chas
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.738 -3.661 -3.435 0.018 85.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7444 0.3961 9.453 <2e-16 ***
## Boston[[x]] -1.8928 1.5061 -1.257 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared: 0.003124, Adjusted R-squared: 0.001146
## F-statistic: 1.579 on 1 and 504 DF, p-value: 0.2094
##
##
## $nox
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.371 -2.738 -0.974 0.559 81.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.720 1.699 -8.073 5.08e-15 ***
## Boston[[x]] 31.249 2.999 10.419 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1756
## F-statistic: 108.6 on 1 and 504 DF, p-value: < 2.2e-16
##
##
## $rm
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.604 -3.952 -2.654 0.989 87.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.482 3.365 6.088 2.27e-09 ***
## Boston[[x]] -2.684 0.532 -5.045 6.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared: 0.04807, Adjusted R-squared: 0.04618
## F-statistic: 25.45 on 1 and 504 DF, p-value: 6.347e-07
##
##
## $age
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.789 -4.257 -1.230 1.527 82.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.77791 0.94398 -4.002 7.22e-05 ***
## Boston[[x]] 0.10779 0.01274 8.463 2.85e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared: 0.1244, Adjusted R-squared: 0.1227
## F-statistic: 71.62 on 1 and 504 DF, p-value: 2.855e-16
##
##
## $dis
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.708 -4.134 -1.527 1.516 81.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4993 0.7304 13.006 <2e-16 ***
## Boston[[x]] -1.5509 0.1683 -9.213 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared: 0.1441, Adjusted R-squared: 0.1425
## F-statistic: 84.89 on 1 and 504 DF, p-value: < 2.2e-16
##
##
## $rad
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.164 -1.381 -0.141 0.660 76.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.28716 0.44348 -5.157 3.61e-07 ***
## Boston[[x]] 0.61791 0.03433 17.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared: 0.3913, Adjusted R-squared: 0.39
## F-statistic: 323.9 on 1 and 504 DF, p-value: < 2.2e-16
##
##
## $tax
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.513 -2.738 -0.194 1.065 77.696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.528369 0.815809 -10.45 <2e-16 ***
## Boston[[x]] 0.029742 0.001847 16.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared: 0.3396, Adjusted R-squared: 0.3383
## F-statistic: 259.2 on 1 and 504 DF, p-value: < 2.2e-16
##
##
## $ptratio
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.654 -3.985 -1.912 1.825 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.6469 3.1473 -5.607 3.40e-08 ***
## Boston[[x]] 1.1520 0.1694 6.801 2.94e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared: 0.08407, Adjusted R-squared: 0.08225
## F-statistic: 46.26 on 1 and 504 DF, p-value: 2.943e-11
##
##
## $black
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.756 -2.299 -2.095 -1.296 86.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.553529 1.425903 11.609 <2e-16 ***
## Boston[[x]] -0.036280 0.003873 -9.367 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared: 0.1483, Adjusted R-squared: 0.1466
## F-statistic: 87.74 on 1 and 504 DF, p-value: < 2.2e-16
##
##
## $lstat
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.925 -2.822 -0.664 1.079 82.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.33054 0.69376 -4.801 2.09e-06 ***
## Boston[[x]] 0.54880 0.04776 11.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared: 0.2076, Adjusted R-squared: 0.206
## F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16
##
##
## $medv
##
## Call:
## lm(formula = Boston[["crim"]] ~ Boston[[x]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.071 -4.022 -2.343 1.298 80.957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.79654 0.93419 12.63 <2e-16 ***
## Boston[[x]] -0.36316 0.03839 -9.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491
## F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
zn (proportion of residential land zoned for lots over 25,000 sq ft) appears to have a significant, negative association with crim judging from the low p-value.indus (proportion of non-retail business acres per town) appears to have a significant, positive association with crim judging from the low p-value.chas (1 if the tract bounds the river and 0 otherwise) does not appear to have a significant association with crim.nox (nitrogen oxides concentration) appears to have a significant, positive association with crim judging from the low p-value.rm (average number of rooms per dwelling) appears to have a significant, negative association with crim.age (proportion of owner-occupied units built prior to 1940) appears to have a significant, positive association with crim.dis (weighted mean of distances to five Boston employment centres) appears to have a significant, negative association with crim.rad (index of accessibility to radial highways) appears to have a significant, positive association with crim.tax (full-value property-tax rate per $10,000) appears to have a significant, positive association with crim.ptratio (pupil-teacher ratio by town) appears to have a significant, positive association with crim.black (\(1000(Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town) appears to have a significant, negative association with crim.lstat (lower status of the population) appears to have a significant, positive association with crim.medv (median value of owner-occupied homes in $1000s) appears to have a significant, negative association with crim.All variables except chas appear to have a significant association with crim in the simple regressions.
par(mfrow = c(4,3))
for(x in names(Boston[, which(!names(Boston) %in% c("chas", "crim"))])) {
plot(Boston[, x], Boston[, "crim"], xlab = x, ylab = "crim")
}
The scatter plots above are consistent with the findings in the simple regressions, but some relationships may be due to faulty data, like, for instance, the spikes at specific values in rad, tax or ptratio.
Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis \(H_0:\beta_j=0\)?
full.lm <- lm(crim ~ ., data = Boston)
summary(full.lm)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
In the model with all predictors, we can reject the null hypothesis that the coefficient estimates are not signficantly non-zero for dis, rad, medv, zn and black, assuming we use 95% confidence as a threshold for significance. The coefficients on these variables are not all consistent with the simple regressions - for example, zn had a significant but small negative relationship with crim in the simple model, but is positive in the multiple regression. Both nox and lstat are significant to the 90% level - the estimate for nox is very different between the two models, being -10.31 in the multiple regression and 31.25 in the simple regression.
How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.
coef_vector <- vapply(single.lm, function(x) coefficients(x)[2], 1.0)
plot(coef_vector, coefficients(full.lm)[-1], xlab = "Univariate coefs", ylab = "Multivariate coefs")
The point in the bottom right of the plot is nox.
Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form \[
Y=\beta_0+\beta_1X +\beta_2X_2^2+\beta_3X_3^3+\epsilon
\] Fitting models for all variables except chas for which this would not be meaningful:
var_names <- names(Boston)[which(!names(Boston) %in% c("chas", "crim"))]
poly.lm <- lapply(var_names, function(x) lm(Boston[["crim"]] ~ poly(Boston[[x]], 3)))
names(poly.lm) <- var_names
lapply(poly.lm, summary)
## $zn
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.821 -4.614 -1.294 0.473 84.130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3722 9.709 < 2e-16 ***
## poly(Boston[[x]], 3)1 -38.7498 8.3722 -4.628 4.7e-06 ***
## poly(Boston[[x]], 3)2 23.9398 8.3722 2.859 0.00442 **
## poly(Boston[[x]], 3)3 -10.0719 8.3722 -1.203 0.22954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared: 0.05824, Adjusted R-squared: 0.05261
## F-statistic: 10.35 on 3 and 502 DF, p-value: 1.281e-06
##
##
## $indus
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.278 -2.514 0.054 0.764 79.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.330 10.950 < 2e-16 ***
## poly(Boston[[x]], 3)1 78.591 7.423 10.587 < 2e-16 ***
## poly(Boston[[x]], 3)2 -24.395 7.423 -3.286 0.00109 **
## poly(Boston[[x]], 3)3 -54.130 7.423 -7.292 1.2e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared: 0.2597, Adjusted R-squared: 0.2552
## F-statistic: 58.69 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## $nox
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.110 -2.068 -0.255 0.739 78.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3216 11.237 < 2e-16 ***
## poly(Boston[[x]], 3)1 81.3720 7.2336 11.249 < 2e-16 ***
## poly(Boston[[x]], 3)2 -28.8286 7.2336 -3.985 7.74e-05 ***
## poly(Boston[[x]], 3)3 -60.3619 7.2336 -8.345 6.96e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.234 on 502 degrees of freedom
## Multiple R-squared: 0.297, Adjusted R-squared: 0.2928
## F-statistic: 70.69 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## $rm
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.485 -3.468 -2.221 -0.015 87.219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3703 9.758 < 2e-16 ***
## poly(Boston[[x]], 3)1 -42.3794 8.3297 -5.088 5.13e-07 ***
## poly(Boston[[x]], 3)2 26.5768 8.3297 3.191 0.00151 **
## poly(Boston[[x]], 3)3 -5.5103 8.3297 -0.662 0.50858
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.33 on 502 degrees of freedom
## Multiple R-squared: 0.06779, Adjusted R-squared: 0.06222
## F-statistic: 12.17 on 3 and 502 DF, p-value: 1.067e-07
##
##
## $age
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.762 -2.673 -0.516 0.019 82.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3485 10.368 < 2e-16 ***
## poly(Boston[[x]], 3)1 68.1820 7.8397 8.697 < 2e-16 ***
## poly(Boston[[x]], 3)2 37.4845 7.8397 4.781 2.29e-06 ***
## poly(Boston[[x]], 3)3 21.3532 7.8397 2.724 0.00668 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.84 on 502 degrees of freedom
## Multiple R-squared: 0.1742, Adjusted R-squared: 0.1693
## F-statistic: 35.31 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## $dis
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.757 -2.588 0.031 1.267 76.378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3259 11.087 < 2e-16 ***
## poly(Boston[[x]], 3)1 -73.3886 7.3315 -10.010 < 2e-16 ***
## poly(Boston[[x]], 3)2 56.3730 7.3315 7.689 7.87e-14 ***
## poly(Boston[[x]], 3)3 -42.6219 7.3315 -5.814 1.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.331 on 502 degrees of freedom
## Multiple R-squared: 0.2778, Adjusted R-squared: 0.2735
## F-statistic: 64.37 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## $rad
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.381 -0.412 -0.269 0.179 76.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.2971 12.164 < 2e-16 ***
## poly(Boston[[x]], 3)1 120.9074 6.6824 18.093 < 2e-16 ***
## poly(Boston[[x]], 3)2 17.4923 6.6824 2.618 0.00912 **
## poly(Boston[[x]], 3)3 4.6985 6.6824 0.703 0.48231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.682 on 502 degrees of freedom
## Multiple R-squared: 0.4, Adjusted R-squared: 0.3965
## F-statistic: 111.6 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## $tax
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.273 -1.389 0.046 0.536 76.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3047 11.860 < 2e-16 ***
## poly(Boston[[x]], 3)1 112.6458 6.8537 16.436 < 2e-16 ***
## poly(Boston[[x]], 3)2 32.0873 6.8537 4.682 3.67e-06 ***
## poly(Boston[[x]], 3)3 -7.9968 6.8537 -1.167 0.244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.854 on 502 degrees of freedom
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.3651
## F-statistic: 97.8 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## $ptratio
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.833 -4.146 -1.655 1.408 82.697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.361 10.008 < 2e-16 ***
## poly(Boston[[x]], 3)1 56.045 8.122 6.901 1.57e-11 ***
## poly(Boston[[x]], 3)2 24.775 8.122 3.050 0.00241 **
## poly(Boston[[x]], 3)3 -22.280 8.122 -2.743 0.00630 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.122 on 502 degrees of freedom
## Multiple R-squared: 0.1138, Adjusted R-squared: 0.1085
## F-statistic: 21.48 on 3 and 502 DF, p-value: 4.171e-13
##
##
## $black
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.096 -2.343 -2.128 -1.439 86.790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3536 10.218 <2e-16 ***
## poly(Boston[[x]], 3)1 -74.4312 7.9546 -9.357 <2e-16 ***
## poly(Boston[[x]], 3)2 5.9264 7.9546 0.745 0.457
## poly(Boston[[x]], 3)3 -4.8346 7.9546 -0.608 0.544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.955 on 502 degrees of freedom
## Multiple R-squared: 0.1498, Adjusted R-squared: 0.1448
## F-statistic: 29.49 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## $lstat
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.234 -2.151 -0.486 0.066 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3392 10.654 <2e-16 ***
## poly(Boston[[x]], 3)1 88.0697 7.6294 11.543 <2e-16 ***
## poly(Boston[[x]], 3)2 15.8882 7.6294 2.082 0.0378 *
## poly(Boston[[x]], 3)3 -11.5740 7.6294 -1.517 0.1299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared: 0.2179, Adjusted R-squared: 0.2133
## F-statistic: 46.63 on 3 and 502 DF, p-value: < 2.2e-16
##
##
## $medv
##
## Call:
## lm(formula = Boston[["crim"]] ~ poly(Boston[[x]], 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.427 -1.976 -0.437 0.439 73.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.292 12.374 < 2e-16 ***
## poly(Boston[[x]], 3)1 -75.058 6.569 -11.426 < 2e-16 ***
## poly(Boston[[x]], 3)2 88.086 6.569 13.409 < 2e-16 ***
## poly(Boston[[x]], 3)3 -48.033 6.569 -7.312 1.05e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.569 on 502 degrees of freedom
## Multiple R-squared: 0.4202, Adjusted R-squared: 0.4167
## F-statistic: 121.3 on 3 and 502 DF, p-value: < 2.2e-16
Again assuming we are interested in significance at the 95% confidence level:
medv, ptratio, dis, age, nox and indus all have significant cubic coefficient estimates;lstat, tax, rad, rm and znall have significant squared coefficient estimates;black is the only predictor with no evidence of non-linearity, with neither the cubic nor squared terms being significant.